March 22, 2005

Why multivariate analysis?

Landscape of tools

What we will cover today

Function for plotting

The iris dataset

head(iris, 4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa

From left to right: I. setosa, I. veriscolor, I. virginica.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

PCA is a tool to compress information from one table into a more manageble space, a PC space.

PC is a new coordinate system such that the directions (principal components) capture the largest variation in the data.

PCA principles

# scale and center table
d <- as.data.frame(scale(iris[, 1:4], center = TRUE, scale = TRUE))

# covariance matrix
covariance_matrix <- cov(d)

# eigenvalues
lambdas <- eigen(covariance_matrix)$values
importance_principal_components <- lambdas / sum(lambdas)

# eigenvectors
v <- eigen(covariance_matrix)$vectors
principal_components <- t(t(v) / colSums(v))

PCA in R

pca <- prcomp(iris[, 1:4], center = TRUE, scale = TRUE)
summary(pca)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

PCA biplot

biplot(pca, xlabs = rep("+", nrow(iris)))

Petals and Sepals

Petal.Length and Petal.Width are highly collinear.

Sepal.Length and Sepal.Width are independent.

A color PCA biplot

plot(pca$x[, 1:2], col = color_for_species(iris$Species))

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA)

Principal Coordinate Analysis (PCoA), also known as metric Multi-Dimensional Scaling (mMDS) is similar to PCA, but it needs distance matrices.

For example, the Euclidean distance: \(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

Principal Coordinate Analysis (PCoA)

\(D = \sqrt{\sum{(x_i - x_j) ^ 2}}\)

head(iris, 2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
# calculating by hand
sqrt(sum( (iris[1, 1:4] - iris[2, 1:4])^2 ))
[1] 0.5385165
distance <- dist(iris[, 1:4])
as.matrix(distance)[1:3, 1:3]
          1         2        3
1 0.0000000 0.5385165 0.509902
2 0.5385165 0.0000000 0.300000
3 0.5099020 0.3000000 0.000000

Principal Coordinate Analysis (PCoA)

image(as.matrix(distance), col = hcl.colors(100, "Zissou1"))

PCoA in R

pcoa <- cmdscale(distance, k = 10, eig = TRUE)
barplot(pcoa$eig[1:4] / sum(pcoa$eig))

PCoA in R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))

PCoA vs PCA R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))
plot(pca$x[, 1:2], col = color_for_species(iris$Species))

PCoA vs PCA R

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
pcoa <- cmdscale(distance, k = 2)
plot(pcoa, col = color_for_species(iris$Species))
plot(pca$x[, 1], -pca$x[, 2], col = color_for_species(iris$Species))

non-metric Multi-Dimensional Scaling (nMDS)

nMDS

nMDS <- MASS::isoMDS(distance + 1e-9)  # two rows cannot be identical
initial  value 3.025865 
iter   5 value 2.637651
final  value 2.582478 
converged
plot(nMDS$points, col = color_for_species(iris$Species))

nMDS vs PCoA

par(mfrow = c(1, 2), mar = c(4, 4, 2, 2))
plot(pcoa, col = color_for_species(iris$Species))
plot(nMDS$points, col = color_for_species(iris$Species))

K-means clustering

K-means clustering: principles

Partition \(n\) observations into \(k\) clusters.

clustering <- kmeans(distance, centers = 2)  # centers is the number of clusters
names(clustering)
[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
# check to which clusters species belong to
with(
  data.frame(species = iris$Species, cluster = clustering$cluster),
  tapply(cluster, species, table)
) |> unlist()
    setosa.1 versicolor.1 versicolor.2  virginica.2 
          50            1           49           50 

K-means clustering: distance, PCA, or nMDS?

Distance:

    setosa.1 versicolor.2 versicolor.3  virginica.2  virginica.3 
          50            1           49           37           13 

PCA:

    setosa.1 versicolor.2 versicolor.3  virginica.2  virginica.3 
          50           39           11           14           36 

nMDS:

    setosa.2 versicolor.1 versicolor.3  virginica.1  virginica.3 
          50            2           48           36           14 

K-means: use distances or PCA

K-means clustering: how to find \(K\)?

Take-home messages

  • When you have multiple response variables or too many dimensions, you need multivariate analysis.
  • Multivariate analysis compresses information so that you can work with it easier.
  • If you have raw values, use PCA.
  • If you have distances or dissimilarities, use PCoA or (better) nMDS.
  • K-mean clustering: supervised machine learning.